-
Notifications
You must be signed in to change notification settings - Fork 76
Feat/close preconfirmed on graceful shutdown #859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| #[tokio::test] | ||
| async fn test_graceful_shutdown_closes_preconfirmed_block( | ||
| #[future] | ||
| #[with(Duration::from_secs(3000000000), false)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be too big? should we not reduce it?
| task.await.unwrap(); | ||
|
|
||
| // Give a small delay to ensure database writes complete | ||
| tokio::time::sleep(Duration::from_millis(100)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed tho? aren't all service (including DB) is closed once we shutdown?
| // Graceful shutdown flow: | ||
| // | ||
| // ┌─────────────────────────────────────────────────────────────────┐ | ||
| // │ Graceful Shutdown Flow │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can format this
| #[rstest::rstest] | ||
| #[timeout(Duration::from_secs(30))] | ||
| #[tokio::test] | ||
| async fn test_graceful_shutdown_closes_preconfirmed_block( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add some more test cases for edge cases?
| if let Some(deadline) = shutdown_state.end_block_deadline.as_ref() { | ||
| tokio::time::sleep_until(*deadline).await | ||
| } else { | ||
| std::future::pending().await | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the if-else statement and directly sleep until deadline
apoorvsadana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, let's try an approach where
- we pass cancellation ctx to executor
- block production service waits for batcher, executor and for block to be closed
| // Executor channel is closed - executor has already shut down | ||
| // This can happen if executor detected channel closure before we sent CloseBlock | ||
| // In this case, the block will remain preconfirmed and be handled on restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this possible? if yes, we should fix it as a part of the PR. we shouldn't have a known race condition. if not, this comment probably needs to change.
| // 1. Cancellation Signal (ctx.cancel_global()) | ||
| // │ | ||
| // ├─> Batcher detects cancellation → exits → closes send_batch channel | ||
| // │ | ||
| // └─> Main loop detects cancellation (via ctx.cancelled()) | ||
| // │ | ||
| // ├─> Check: Is there an open preconfirmed block? | ||
| // │ │ | ||
| // │ ├─> YES: Send CloseBlock command to executor | ||
| // │ │ │ | ||
| // │ │ ├─> Executor receives CloseBlock → sets force_close = true | ||
| // │ │ │ │ | ||
| // │ │ │ └─> On next iteration, executor checks force_close | ||
| // │ │ │ │ | ||
| // │ │ │ └─> Calls finalize() → sends EndBlock message | ||
| // │ │ │ | ||
| // │ │ └─> Main loop waits for EndBlock (with timeout) | ||
| // │ │ │ | ||
| // │ │ ├─> EndBlock received → process_reply() closes block | ||
| // │ │ │ └─> Return (shutdown complete) | ||
| // │ │ │ | ||
| // │ │ └─> Timeout/Error → block remains preconfirmed (handled on restart) | ||
| // │ │ | ||
| // │ └─> NO: Continue to wait for batcher completion | ||
| // │ | ||
| // 2. Batcher Completion (alternative path if cancellation not detected first) | ||
| // │ | ||
| // └─> Check: Is there an open preconfirmed block? | ||
| // │ | ||
| // └─> Same flow as above (send CloseBlock → wait for EndBlock) | ||
| // | ||
| // 3. Executor Shutdown | ||
| // │ | ||
| // └─> Executor detects send_batch channel closure → exits gracefully | ||
| // └─> Signals via executor.stop channel | ||
| // | ||
| // Key Safety Features: | ||
| // - Timeout protection: Won't wait indefinitely for EndBlock | ||
| // - State validation: Only sends CloseBlock when block exists | ||
| // - Graceful degradation: If executor already shut down, block handled on restart | ||
| // - No re-execution: Uses executor's existing state (no transaction re-execution needed) | ||
| // | ||
| // Note: The executor thread shuts down, dropping the `executor.stop` channel, therefore closing it as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think these comments are more detailed than they need to be. i am worried they will get outdated very easily because someone will update the executor thread and forget to change this comment because its in a completely different place. maybe this detail makes sense in the PR desc
| // - Executor crashes before sending EndBlock | ||
| // - EndBlock gets lost in transit | ||
| // - Any other unexpected failure | ||
| _ = async { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't the services layer already have a deadline after which it forcefully kills the service?
| // 4. If CloseBlock succeeds → wait for EndBlock | ||
| // 5. If CloseBlock fails or no block → shutdown complete | ||
| res = &mut batcher_task, if !shutdown_state.batcher_completed => { | ||
| res.context("In batcher task")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the batcher task closes because of an error, then ? will technically return before we gracefully shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment still applies?
| // 4. If CloseBlock succeeds → wait for EndBlock | ||
| // 5. If CloseBlock fails or no block → shutdown complete | ||
| res = &mut batcher_task, if !shutdown_state.batcher_completed => { | ||
| res.context("In batcher task")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment still applies?
| self.process_reply(reply).await.context("Processing reply from executor thread")?; | ||
|
|
||
| // If we're shutting down and just processed EndBlock, shutdown is complete | ||
| if shutting_down && is_end_block { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's possible batcher hasn't closed yet right? image a scenario where batcher didn't detect signal yet, shutting_down has been made true and executor closed block because block time was reached.
| if shutting_down { | ||
| // Executor exited during shutdown | ||
| // If there was a block, executor should have sent EndBlock before exiting | ||
| // If we're here, either no block existed or executor panicked | ||
| // In case of panic, it will propagate naturally | ||
| tracing::debug!("Executor shut down during graceful shutdown"); | ||
| return Ok(()); | ||
| } | ||
| // Normal executor completion (not during shutdown) | ||
| // If executor panicked, recv() will resume the panic (handled by StopErrorReceiver) | ||
| res.context("In executor thread")?; | ||
| return Ok(()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just check if res is an error, if yes we add context, otherwise we don't? or maybe we add context all the time. unsure what this if condition is solving here.
| #[rstest::rstest] | ||
| #[timeout(Duration::from_secs(30))] | ||
| #[tokio::test] | ||
| async fn test_graceful_shutdown_closes_preconfirmed_block_with_multiple_transactions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's combine this test and the 1 transaction test?
i think having multiple cases like these make more sene when we've smaller unit tests, not in these ones (or the ones we had in cairo native)
| // This test verifies that graceful shutdown completes successfully even in edge cases. | ||
| // With the new implementation, the executor automatically closes blocks when it detects | ||
| // the send_batch channel closure (WaitTxBatchOutcome::Exit). This test ensures shutdown | ||
| // completes gracefully regardless of timing. | ||
| // | ||
| // The scenario: Cancellation → batcher completes → closes send_batch → executor detects Exit | ||
| // → executor closes block automatically → sends EndBlock → main loop closes block → shutdown complete | ||
| #[rstest::rstest] | ||
| #[timeout(Duration::from_secs(30))] | ||
| #[tokio::test] | ||
| async fn test_graceful_shutdown_closeblock_fails_executor_channel_closed( | ||
| #[future] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the edge case here? not sure how is this different from the previous test case?
Graceful Shutdown with Preconfirmed Block Closing
Summary
Implements graceful shutdown for the block production service that properly closes any open preconfirmed block during shutdown, using the executor's existing state without re-execution.
Problem
Previously, when graceful shutdown was triggered, the block production service would exit immediately, leaving any open preconfirmed block unclosed. This required re-execution of transactions on restart, which is inefficient and can cause inconsistencies.
Solution
The implementation adds a graceful shutdown flow that:
CloseBlockcommand to the executor thread (which is still running)EndBlockmessage from the executor with the block's execution summaryImplementation Details
Shutdown Flow
The shutdown process follows 5 concurrent paths in a
tokio::select!loop:EndBlockmessage from executorState Management
ShutdownState: Encapsulates shutdown state (shutting_down,batcher_completed,end_block_deadline)try_close_block_on_shutdown(): Attempts to close block and returnsControlFlowfor loop controlNote
If graceful shutdown fails for any reason, the block remains preconfirmed and is handled on restart via the existing
close_preconfirmed_block_if_exists()mechanism.